Dev/spark backend new comparator#213
Open
TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
Open
Dev/spark backend new comparator#213TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
Conversation
fixed set_value
…ation and PandasDataset
…icate code get_values and iget_values from PandasDataset
- Fix __init__ to conditionally initialize physical index based on flag - Implement iloc with physical_index_actual_flag checking - Add physical_index_actual_flag=False to loc, sort_values, dropna - Exclude utility columns from fillna, drop, rename operations - Use _public_columns in agg, mode, log to avoid utility column processing - Add warnings when user attempts to modify utility columns BREAKING CHANGE: iloc now requires physical_index_actual_flag to be True
…ues, from_dict, to_dict, to_records, index. Testing and limitation required.
Testing and limitation required.
…park' into dev/spark_backend_new_comparator # Conflicts: # hypex/dataset/backends/pandas_backend.py # hypex/dataset/backends/spark_backend.py
# Conflicts: # hypex/dataset/abstract.py # hypex/dataset/backends/pandas_backend.py # hypex/dataset/backends/spark_backend.py # hypex/dataset/dataset.py # hypex/utils/__init__.py # hypex/utils/typings.py # tests/test_spark_backend.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stats-based comparator for Spark-efficient hypothesis testing
Summary
BaseComparatoras a shared root, keepingGroupsComparator(formerComparator) for raw-data comparisons and adding the newStatsComparatorfor aggregation-based comparisons.StatsComparator: a two-phase abstract comparator that operates on pre-aggregated sufficient statistics instead of raw data slices. Phase 1 issues a single.agg()call across all target columns and groups; Phase 2 runs analytical tests on the returned scalar dicts — entirely driver-side. This reduces Spark jobs from O(columns × groups) to a constant one per executor.AggTTest: a concreteStatsComparatorimplementing Welch's t-test from{mean, var, count}statistics. Produces the same output shape asTTestand is a drop-in replacement in pipelines where raw data transfer is expensive.GroupedDataset.aggwith list input: flatten the PandasMultiIndexproduced by list-style aggregation into{col}┆{stat}column names, whichStatsComparator.executerelies on.Motivation
The existing
Comparator/GroupsComparatorpattern pulls raw group data to the driver before running statistical tests. On a Spark backend this causes one separate distributed job per (group pair × column), which is prohibitively slow for wide datasets.StatsComparator+AggTTestsolve this by aggregating in one distributed pass and doing the math locally on small scalar dicts.Files changed
hypex/comparators/abstract.pyBaseComparator,GroupsComparator(renamed fromComparator),StatsComparatorhypex/comparators/stats_hypothesis_testing.pyAggTTesthypex/comparators/__init__.pyhypex/dataset/groupby_dataset.pyagg()hypex/utils/constants.py/__init__.pyTest plan
GroupsComparator(formerlyComparator) behaviour is unchangedAggTTestp-values matchTTeston the same datasetGroupedDataset.aggwith a list of stat functions produces flatcol┆statcolumn namesStatsComparator/AggTTestagainst a Spark-backedExperimentData